Cell Systems — Latest Matching Preprints

1

Joint linear modeling of transcriptomics and proteomics is predictive of cancer metastasis

Sharma, R.; Meimetis, N.; Begzati, A.; Nagar, S. D.; Kellman, B.; Baghdassarian, H. M.

2025-02-20 systems biology 10.1101/2025.02.15.638428 medRxiv

Top 0.1%

48.9%

Show abstract

A central goal of conducting omics measurements is to understand how molecular features inform higher-order cell- and tissue-level phenotypes. In particular, multi-omics offers insights into how information encoded by the genome is coordinated through biological layers, resulting in functional outputs1. Due to myriad post-transcriptional regulatory processes, the coordination between mRNA and protein cannot be simply reduced to gene-wise correlation. Yet, both modalities have been shown to serve as representations of biological state, and multi-omics integration has been used to improve these representations. Multi-omics approaches typically do not focus on how mRNA and protein features coordinate, but rather use the additional information for improved prediction or feature selection. Here, instead, we showed that standard linear machine learning models provide an understanding of transcriptomic and proteomic coordination in the context of a biological phenotype of interest, in this case cancer metastasis. We find that, in the context of metastasis, a select subset of proteomic features--reflecting a more concentrated signal relative to the broadly distributed transcriptomic signal--offers additional information to that encoded by transcriptomics, as demonstrated by improved model performance when integrating the two modalities and the relative feature importance of proteomics. Top features show a depletion of gene-product overlap across modalities, indicating that the model primarily leverages instances in which the two modalities are providing complementary information with respect to phenotype. However, in instances when both modalities are selected for a given gene product, there is high information consistency that synergistically bolsters phenotype prediction. Altogether, by using model fits that relate both modalities to phenotype, we observe a nuanced coordination of protein and mRNA, in which both modalities tend to provide consistent information about phenotype, yet benefits remain to incorporating a combination of both complementary and reinforcing signals across modalities.

2

Similarity bias from consensus perturbational signatures from the L1000 Connectivity Map

Smith, I.; Scott, K.; Haibe-Kains, B.

2024-01-07 bioinformatics 10.1101/2022.01.24.477615 medRxiv

Top 0.1%

45.3%

Show abstract

In recent years, high-throughput perturbational datasets have become an important tool for rapidly characterizing the function of large collections of chemical compounds. To overcome the biological and technical noise in these experiments, researchers have used consensus signatures - averages of multiple experiments - to summarize the effects of perturbations. In this work, we demonstrate that consensus signatures on the L1000 Connectivity Map show a pervasive similarity bias: as more signatures are averaged, the resulting consensus signatures are increasingly similar to each other, regardless of whether the signatures are related. We show that the distribution of Pearsons correlation changes as a function of the number of signatures averaged. The artifactual similarity bias is caused by skewness in the data and a consequence of using median normalization on non-normal distributions. Furthermore, we show that mean normalization can partly remedy this similarity bias and improve power to identify associations. The similarity bias introduced by consensus signatures is an important potential confounder of analysis of perturbational datasets, and our practical solution could easily be applied by practitioners in the field to improve the analysis of the L1000 Connectivity Map.

3

Noise-guided tuning of synthetic protein waves in living cells

Bolshakov, D. T.; Weix, E. W. Z.; Galateo, T. M.; Rajasekaran, R.; Coyle, S. M.

2025-03-21 synthetic biology 10.1101/2025.03.21.644572 medRxiv

Top 0.1%

40.9%

Show abstract

Biological systems use protein circuits to organize cellular activities in space and time, but engineering synthetic dynamics is challenging due to stochastic effects of genetic and biochemical variation on circuit behavior. Genetically encoded oscillators (GEOs) built from bacterial MinDE-family ATPase and Activator modules generate fast orthogonal protein waves in eukaryotic cells, providing an experimental model system for genetic and biochemical coordination of synthetic protein dynamics. Here, we use budding yeast to experimentally define and model phase portraits that reveal how the breadth of frequencies and amplitudes available to a GEO are genetically controlled by ATPase and Activator expression levels and noise. GEO amplitude is encoded by ATPase absolute abundance, making it sensitive to extrinsic noise on a population level. In contrast, GEO frequency is remarkably stable because it is controlled by the Activator:ATPase ratio and thus affected primarily by intrinsic noise. These features facilitate noise-guided design of different expression strategies that act as filters on GEO waveform, enabling us to construct clonal populations that oscillate at different frequencies as well as independently tune frequency and amplitude variation within a single population. By characterizing 169 biochemically distinct GEOs, we provide a rich assortment of phase portraits as starting points for application of our waveform engineering approach. Our findings suggest noise-guided design may be a valuable strategy for achieving precision control over dynamic protein circuits.

4

Mechanotransduction-Aware Causal Omics on Tissue Scaffolds: A Controlled Mechanochemical Framework for Identifying Disease Genes Beyond Pure Omics Analysis

Xu, T.; Hu, Z.; Sun, X.; Xiong, M.

2026-04-22 bioinformatics 10.64898/2026.04.19.719528 medRxiv

Top 0.1%

40.9%

Show abstract

Omics-based disease-gene discovery is typically performed as if molecular states evolve independently of tissue mechanics. Most current pipelines analyze transcriptomic or multimodal molecular data alone and identify abnormal genes using differential expressions, latent trajectories, or association-based recovery under treatment. However, in mechanically active tissues, gene expression is shaped not only by internal regulatory networks but also by mechanotransduction arising from strain, curvature, force transmission, and scaffold geometry. This raises a fundamental question: should disease-gene identification in tissues be treated as a pure omics association problem, or as a causal mechanochemical inference problem? We introduce a mechanotransduction-aware causal omics framework on a Cosserat tissue scaffold. Gene expression evolves through intrinsic regulatory dynamics, spatial diffusion, external control, and a mechanotransduction term driven by scaffold mechanics. To distinguish causation from association, we define a hidden mechano-drug rescue channel in the true data-generating system and compare predictive models that either include or omit mechanotransduction. We show that association-based rankings can incorrectly elevate downstream homeostatic or repair genes, even when the disease gene is the true direct mechanochemical target. By contrast, a causal ranking based on reconstruction of the direct mechanotransduction intervention effect correctly identifies the disease gene as the strongest beneficiary. These results argue that popular pure-omics analysis is insufficient for disease-gene discovery in mechanically structured tissues. Mechanotransduction should be modeled as part of the causal structure of tissue biology rather than treated as a secondary covariate or omitted entirely.

5

Gene-First Identity Construction for Robust Cell Identification in Single-Cell Transcriptomics

Yang, L.; Huang, Z.; Cai, J.; Xin, H.

2026-02-26 bioinformatics 10.64898/2026.02.25.707869 medRxiv

Top 0.1%

40.6%

Show abstract

The precise delineation of cell types is fundamental to single-cell transcriptomics, yet current clustering pipelines often violate an axiomatic principle: hierarchical consistency. Existing methods measure cell-to-cell distances within a fixed global feature space, disregarding the fact that biological distinctions are inherently context-dependent lineage separation requires different gene programs than subtype resolution. Mathematically, this implies that the similarity metric itself should not be a static functional, but a pair-dependent energy functional evaluated within a specific Hilbert subspace determined by the biological comparison at hand. The challenge lies in the fact that allowing pair-dependent metrics typically destroys the global geometric consistency required for downstream analysis, unless the family of Hilbert subspaces is given strong biological structure. To resolve this geometric dilemma, we introduce GeCCo (Gene Co-expression Constructed identity), which constructs identities by projecting cells onto a rigorously derived hierarchy of gene programs. To construct this hierarchy, GeCCo first quantifies Boolean regulatory logic via the{phi} coefficient, and subsequently employs a greedy topological inference to organize genes based on their synergistic and antagonistic relationships. Benchmarking on human immune atlases demonstrates that GeCCo achieves superior hierarchical consistency, ensuring that globally inferred cell identities rigorously match locally refined subtypes. Furthermore, in pancreatic endocrine progenitors, GeCCo resolves a hidden mitotic bridge state, suggesting a concentrated division phase prior to differentiation. Ultimately, GeCCo shifts the paradigm from ad hoc clustering to programmatic cell typing, offering a mathematically grounded framework for scalable atlases of cellular discovery.

6

Impacts of batch effects on the performance of machine learning classifiers across multiple studies

Raab, P.; Johnson, W. E.; Piccolo, S. R.

2026-06-30 bioinformatics 10.64898/2026.06.24.734352 medRxiv

Top 0.1%

39.9%

Show abstract

Precision medicine relies on accurate and generalizable predictions for patients across the spectrum of human diversity. Because capturing biological heterogeneity requires large sample sizes, researchers must often aggregate data from several experimental batches or independent studies. This integration allows for greater statistical power and diversity than a single study could provide, while avoiding the costs of generating massive new -omics datasets. Predictive models trained on these aggregated data are theoretically better equipped to detect subtle patterns that generalize to new data. However, this potential is frequently undermined by "batch effects"--systematic technical artifacts that can bias model training to predict experimental batches and shadow meaningful biological conditions. Models trained on data with batch effects can exhibit substantially degraded performance when applied to data from new batches. Statistical adjustment methods can mitigate these artifacts while preserving biological signals. To ensure these adjustments actually facilitate generalization, we emphasize the use of external, independent cohorts for rigorous validation. This chapter examines how batch effects impact predictions and compares various adjustment methods.

7

Computational screen of promoter configurations that robustly sense transcription factor dynamics

Shoyer, T. C.; Di Ventura, B.

2026-05-20 systems biology 10.64898/2026.05.19.726340 medRxiv

Top 0.1%

39.1%

Show abstract

Transcription factors (TFs) respond to external stimuli with time-varying changes in activity or localization (TF dynamics), driving differential transcriptional programs. Previous studies indicated that TF dynamics can be decoded at the promoter level in eukaryotes, yet a systematic understanding of robust solutions is lacking. By computationally screening over 10,000 mathematical models of multi-state promoters with various forms of TF-mediated regulation, we identify robust configurations that selectively respond to sustained ("pulse filtering") or pulsatile ("pulse boosting") TF dynamics. Promoters that activate via intermediate states and have negatively regulated deactivation robustly perform pulse filtering. In contrast, robust pulse boosting is achieved by promoters with a TF-mediated refractory state that permits short activation and recovers between pulses. Bifunctional TFs that exert activator- and repressor-like regulation extend the design space for pulse boosting. These results reveal general principles by which promoters interpret TF dynamics and suggest strategies to engineer synthetic systems to exploit them. HighlightsO_LIComputational screen of over 10,000 promoter models identifies features that enable promoters to selectively respond to sustained ("pulse filtering") or pulsatile ("pulse boosting") transcription factor (TF) dynamics. C_LIO_LIPromoters that activate via intermediate states and have negatively regulated deactivation robustly perform pulse filtering. C_LIO_LIPromoters with TF-regulated refractoriness robustly perform pulse boosting. C_LIO_LIPromoters regulated by bifunctional TFs extend the design space for pulse boosting. C_LI

8

Designing biochemical circuits with tree search

Bhamidipati, P. S.; Thomson, M.

2025-01-29 systems biology 10.1101/2025.01.27.635147 medRxiv

Top 0.1%

38.5%

Show abstract

Discovering biochemical circuits that exhibit a desired behavior is an outstanding problem in biological engineering. The traditional approach of enumerating every possible circuit topology becomes intractable for circuits with more than four components due to combinatorial scaling of the search space. Here, we use Monte Carlo Tree Search (MCTS), a reinforcement learning (RL) algorithm, to optimize circuit topology for a target phenotype by approaching circuit design as a sequence of assembly decisions. Our RL-based design framework, which we call CircuiTree, efficiently and comprehensively finds robust designs for three-component oscillators by prioritizing sparsity. CircuiTree can also infer candidate network motifs from its search results, producing similar results to enumeration. Using parallel MCTS, we scale this workflow up to five components and find that highly fault-tolerant designs use a novel strategy, which we call motif multiplexing. Multiplexed circuits contain many overlapping network motifs that each activate in different mutational scenarios. The evolutionary robustness of multiplexing may explain the ubiquity of multiple sub-oscillators in circadian clock circuits. Overall, CircuiTree provides the first scalable computational platform for designing biochemical circuits.

9

Breaking the Synthesis Barrier for AI-Designed DNA Libraries

Sussex, S.; Borevkovic, E.; Lohmann, F.; Chen, N.; Lüthi, E.; Reddy, S. T.; Krause, A.

2026-07-07 bioengineering 10.64898/2026.07.07.736931 medRxiv

Top 0.1%

38.3%

Show abstract

Designing DNA libraries is a key challenge from drug design to protein engineering and synthetic biology. Modern generative models offer opportunities to navigate the design space and propose specific sequences predicted to be effective in-silico. Designing deterministic libraries of specific sequences is however limited by the cost of DNA synthesis -- the synthesis barrier. In contrast, high-throughput multiplexed screening can measure the function of billions of biological sequences in parallel. Harnessing this technology requires the design of randomized libraries with specific design constraints to achieve low synthesis costs. In practice, such stochastic libraries are often chosen heuristically, sacrificing control for scale. Is there a way to bridge AI-based in-silico sequence design with high-throughput experimentation? In this work, we introduce Policy Gradients for Library Design (PGLD). PGLD uses a synthesis-aware parametrization of stochastic DNA libraries and optimizes them against a specified objective function. This allows for designing massive, controlled libraries without being limited by synthesis costs. We show how PGLD enables lab-in-the-loop design of multi-round high-throughput experiments, and large-scale in-vitro DNA sampling from generative models. Finally, we use PGLD to design a library of ~10^6 unique sequences which is synthesized at a cost of ~700 USD to explore the mutation space of a broadly neutralizing influenza antibody.

10

Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models

Weinstein, E. N.; Amin, A. N.; Grathwohl, W.; Kassler, D.; Disset, J.; Marks, D.

2021-10-29 synthetic biology 10.1101/2021.10.28.466307 medRxiv

Top 0.1%

37.7%

Show abstract

Generative probabilistic models of biological sequences have widespread existing and potential applications in analyzing, predicting and designing proteins, RNA and genomes. To test the predictions of such a model experimentally, the standard approach is to draw samples, and then synthesize each sample individually in the laboratory. However, often orders of magnitude more sequences can be experimentally assayed than can affordably be synthesized individually. In this article, we propose instead to use stochastic synthesis methods, such as mixed nucleotides or trimers. We describe a black-box algorithm for optimizing stochastic synthesis protocols to produce approximate samples from any target generative model. We establish theoretical bounds on the methods performance, and validate it in simulation using held-out sequence-to-function predictors trained on real experimental data. We show that using optimized stochastic synthesis protocols in place of individual synthesis can increase the number of hits in protein engineering efforts by orders of magnitude, e.g. from zero to a thousand.

11

Geometric Multidimensional Representation of Omic Signatures

Almeida Cordeiro Nogueira, H.; Medina-Acosta, E.

2026-01-28 bioinformatics 10.64898/2026.01.26.701791 medRxiv

Top 0.1%

34.4%

Show abstract

Multi-omic signatures are widely used in biomarker discovery, precision oncology, and systems biology, yet they are typically treated as vectors or composite scores that collapse intrinsically multidimensional biological organization into one-dimensional summaries. As a result, their internal structure, contextual dependencies, and mechanistic coherence remain largely inaccessible. Here, we introduce a geometric framework that reconceptualizes omic signatures as multidimensional informational entities whose biological meaning arises from structural organization rather than molecular membership alone. Each signature is embedded in a shared latent space integrating regulatory, phenotypic, microenvironmental, immune, and clinical constraints, and represented as a convex polytope. This representation preserves internal organization and enables intrinsic geometric measurements--including barycenter distance, volume, anisotropy, and asymmetry--that quantify concordance, divergence, and latent complexity. We apply this framework to 24,796 metabolic regulatory circuitries reconstructed across 32 TCGA cancer types, encoded as paired regulatory and metabolic signatures in an 18-dimensional latent space. Geometric analysis shows that discordance predominates: most circuitries occupy strong or extreme discordance regimes and display high-dimensional, frequently asymmetric geometries, whereas fully concordant circuitries are rare and structurally constrained. These geometric phenotypes stratify metabolic pathways and superfamilies in reproducible, non-uniform patterns that are not detectable with vector- or network-based representations. By transforming omic signatures into measurable geometric objects, this framework enables principled comparison, de-redundancy, and mechanistic interpretation of multi-omic biomarkers, providing a scalable approach for analyzing complex regulatory systems across cancer and beyond. All geometric representations and derived descriptors are available through the SigPolytope Shiny application (https://sigpolytope.shinyapps.io/geometricatlas/).

12

Paired evaluation defines performance landscapes for machine learning models

Nariya, M. K.; Mills, C. E.; Sorger, P. K.; Sokolov, A.

2022-09-12 bioinformatics 10.1101/2022.09.07.507020 medRxiv

Top 0.1%

34.0%

Show abstract

The true accuracy of a machine learning model is a population-level statistic that cannot be observed directly. In practice, predictor performance is estimated against one or more test datasets, and the accuracy of this estimate strongly depends on how well the test sets represent all possible unseen datasets. Here we present paired evaluation, a simple approach for increasing the robustness of performance evaluation by systematic pairing of test samples, and use it to evaluate predictors of drug response in breast cancer cell lines and of disease severity in patients with Alzheimers Disease. Our results demonstrate that the choice of test data can cause estimates of performance to vary by as much as 30%, and that paired evaluation makes it possible to identify outliers, improve the accuracy of performance estimates in the presence of known confounders, and assign statistical significance when comparing machine learning models.

13

Perturbation Curve models continuous transcriptional response trajectories and improves prediction of genetic modulations

Zhong, Y.; wang, l.; Yang, G.; Yu, L.; Qi, X.; Jiang, H.

2026-06-19 bioinformatics 10.64898/2026.06.16.732192 medRxiv

Top 0.1%

33.8%

Show abstract

Single-cell CRISPR screens, Perturb-seq, have revolutionized functional genomics by revealing biological causality. However, although perturbation assignments are typically represented as discrete labels, the cell-level effective strength of perturbations is often continuous and diverse. Current analytical frameworks struggle to decouple the variability in perturbation strength from the diversity of downstream responses. Here, we present Perturbation Curve (PertCurve), a nonlinear, curve-based computational framework that models the trajectories of transcriptomic responses by explicitly incorporating diverse perturbation magnitudes and strengths. By ordering cells by perturbation strength, we demonstrate that PertCurve accurately recapitulates the response magnitudes and reveals the distinct modularity and asynchrony patterns of downstream gene behaviors. These patterns are categorized into archetypes, including proportional, sensitive, and threshold responses. By applying this framework across CRISPRi/a modalities, we identify universal response patterns in viral infection, apoptosis, and proliferation genes, and reveal previously overlooked context-specific regulatory features in cell differentiation. Finally, incorporating PertCurve into perturbation prediction models and evaluation metrics enhances predictive performance, delivering actionable insights for refining established models.

14

Single-cell hit calling in high-content imaging screens with Buscar

Serrano, E.; Li, W.-s.; Way, G. P.

2026-04-19 bioinformatics 10.64898/2026.04.15.718737 medRxiv

Top 0.1%

33.7%

Show abstract

High-content screening (HCS) enables the systematic quantification of single-cell morphology features across thousands of perturbations, capturing rich phenotypic heterogeneity. Image-based profiling is a critical bioinformatics processing step in this pipeline, as researchers use it to predict mechanisms of action, assess toxicity, perform hit calling, and more. However, current image-based profiling workflows rely on aggregate statistics, such as calculating mean or median feature values per well, implicitly assuming cell homogeneity. This limitation obscures subpopulation effects, reducing sensitivity to subtle or heterogeneous effects of perturbations. Here we present Buscar, a method that leverages the full heterogeneity of single-cell image-based profiles to call hits. Buscar requires two reference, single-cell populations that define distinct morphology states: a reference state (e.g., disease cells) and a target state (e.g., healthy cells). Buscar then compares these two groups to define on- and off-morphology signatures, which it then uses to score every perturbation in a given screen. The scores quantify perturbation efficacy and off-target effects, or specificity, in an interpretable manner, clarifying which morphologies are appropriately altered and which may arise from off-target activity. We apply Buscar to three datasets. First, as a proof of concept, we applied Buscar to a Cell Painting dataset of cardiac fibroblasts from patients with heart failure. Buscar quantifies both morphology rescue and off-target morphology activity in these cells treated with a TGF{beta} receptor inhibitor. Second, we show that Buscar recovers biologically coherent gene-phenotype associations across 16 manually-labeled phenotypes in the MitoCheck dataset. Lastly, applied to CPJUMP1, we show that Buscar is robust to technical replicates collected across plates in both small-molecule and CRISPR-Cas9 perturbations. Together, these results establish Buscar as a reproducible and interpretable hit calling method that overcomes aggregation bias, enabling the simultaneous quantification of compound efficacy and specificity to enhance hit calling in HCS. We release Buscar as an open-source python package.

15

Cell-type-agnostic differential gene expression uncovers conserved principles of cellular regulation

Hummels, A. R.; Camacho, C. J.

2025-10-01 systems biology 10.1101/2025.09.29.679285 medRxiv

Top 0.1%

33.6%

Show abstract

Cellular responses to perturbagens--including pharmacological compounds and genetic manipulations--remain incompletely characterized, as conventional approaches are constrained by cell-type-specific biases. Here, we derive consensus signatures (CS) that uncover conserved principles not of gene expression, but of fundamental cellular regulation. Quantified by aggregating differential gene expression responses across diverse cell types, time points, and doses, these CS retain the core features of individual experiments while preserving biological relevance. As proof of concept, genome-wide CS screening identified three knockdown-resistant genes--CCNA1, ORC1, and SOX2--that also engage in their reciprocal up-regulation. Expression of this network is modulated by the factor E2F2, revealing a mechanism by which proliferative signaling persists despite diverse genetic and chemical perturbations. Crucially, we identify that ORC1 down-regulation by specific CDK4/6 inhibitors singularly disrupts this feedback loop, providing a direct route to suppress E2F2-driven proliferation. More generally, we demonstrate for the first time that CS can accurately predict RNA-seq responses in novel cell lines, uncovering evolutionarily conserved mechanisms that regulate fundamental biological processes beyond context-specific variability.

16

The zoo of the gene networks capable of pattern formation by extracellular signaling

Anhon, K. M.; Ciudad, I. S.

2025-05-07 systems biology 10.1101/2025.05.06.652477 medRxiv

Top 0.1%

33.6%

Show abstract

A fundamental question of developmental biology is pattern formation, or how cells with specific gene expression end up in specific locations in the body to form tissues, organs and, overall, functional anatomy. Pattern formation involves communication through extracellular signals and complex intracellular gene networks integrating these signals to determine cell responses (e.g., further signaling, cell division, cell differentiation, etc.). In this article we ask: 1) Are there any logical or mathematical principles determining which gene network topologies can lead to pattern formation by cell signaling over space in multicellular systems? 2) Can gene network topologies be classified into a small number of classes that entail similar dynamics and pattern transformation capacities? In this article we combine logical arguments and mathematical proofs to show that, despite the large amount of theoretically possible gene network topologies, all gene network topologies necessary for pattern formation fall into just three fundamental classes and their combinations. We show that gene networks within each class share the same logic on how they lead to pattern formation and hence, lead to similar patterns. We characterize the main features of each class and discuss how they constitute an exhaustive zoo of pattern-forming gene networks. This zoo includes all gene networks that, to our knowledge, are experimentally known to lead to pattern formation as well as other gene networks that have not yet been found experimentally. Significance StatementA fundamental question of developmental biology is pattern formation. In this article we ask: 1) Are there any logical or mathematical principles determining which gene network topologies can lead to pattern formation by cell signaling over space in multicellular systems? 2) Can gene network topologies be classified into a small number of classes that entail similar dynamics and pattern transformation capacities? We show that, despite the large amount of theoretically possible gene network topologies, all gene network topologies necessary for pattern formation fall into just three fundamental classes and their combinations. We show that gene networks within each class share the same logic on how they lead to pattern formation and hence, lead to similar patterns.

17

OmniPert: A Deep Learning Foundation Model for Predicting Responses to Genetic and Chemical Perturbations in Single Cancer Cells

Taj, F.; Stein, L. D.

2025-07-05 bioinformatics 10.1101/2025.07.02.662744 medRxiv

Top 0.1%

33.5%

Show abstract

In cancer, intra- and inter-patient heterogeneity presents a significant challenge for therapeutic management, as patients with apparently similar profiles often exhibit divergent responses to the same therapies. This heterogeneity is primarily attributed to genetic and molecular variations among individuals and their tumors. Understanding the impact of these differences on treatment outcomes is widely believed to be a key step for developing effective precision medicine strategies. However, the complexity of most biological pathways makes it difficult to predict the effect of genetic variation on cells and tissues, let alone predict a patients response to therapy. As a result, high-throughput genetic and chemical perturbation screens have emerged as valuable tools for precision medicine-related tasks, such as disease modeling, target discovery, cellular programming, and pathway reconstruction. This approach is fundamentally limited, however, because the number of possible combinations of cell types, cell states, perturbation targets, and perturbation types is huge and cannot be exhaustively tested experimentally. This calls for computational approaches that can simulate such experiments in silico, guiding in vitro experiments towards perturbations that are more likely to produce the desired effect. Here we describe OmniPert, a novel generative AI tool, which utilizes a deep learning, transformer-based architecture to model the effects of genetic and chemical perturbations on single-cell transcriptomes. Trained on millions of diverse cellular profiles, this approach allows for more granular analysis of cellular responses, thereby facilitating downstream applications in cell-specific gene-gene and gene-drug interaction networks, biomarker and drug target discovery, drug repurposing, and in silico perturbation reverse-engineering. In the context of oncology, OmniPert promises to facilitate the discovery of novel cell type- and state-specific targets, ultimately contributing to more effective and personalized cancer treatments.

18

Harnessing AI to Build Virtual Cells

Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E. P.

2026-04-30 bioinformatics 10.64898/2026.04.11.717183 medRxiv

Top 0.1%

32.8%

Show abstract

A virtual cell is a world model of a cell: a computational system that predicts, simulates and programs cellular processes across modalities and scales. An important path toward this goal is to model how genetic and chemical perturbations give rise to transcriptional responses, a core capability for disease understanding and drug discovery. However, current approaches remain expert-intensive, relying on iterative manual model design, training and debugging over months. Here we present VCHarness, an autonomous AI system that constructs perturbation-response models by combining an AI coding agent with multimodal biological foundation models. The system explores large spaces of architectures and training pipelines with minimal human intervention, iteratively generating, evaluating and refining candidate models. Across multiple perturbation-response benchmarks, VCHarness identifies architectures that outperform expert-designed approaches while reducing development time from months to days. It further uncovers non-obvious architectural patterns associated with improved performance, indicating that automated search can extend beyond conventional design strategies. These results suggest a shift from manually engineered models toward autonomous systems for constructing components of virtual cell world models, enabling scalable and data-driven exploration of cellular systems.

19

Limits to the inference of gene regulation from bulk tissue expression data

Chu, C. P.; Morin, A. A.; Pavlidis, P.

2025-02-25 bioinformatics 10.1101/2024.10.24.619521 medRxiv

Top 0.1%

32.2%

Show abstract

MotivationThousands of studies have used co-expression analysis of bulk tissue samples to probe gene regulation. However, the extent that intracellular regulatory signals are present in these data is unclear. Specifically, we lack clarity of the factors that promote or impede the propagation of intracellular regulatory signals from the single cell level to the bulk tissue level. To bring these issues into focus, we developed a novel computational simulator, grounded in real data, to explore the theoretical relationship between events in single cells and bulk tissue expression profiles, and clarify the conditions required for the propagation of intracellular regulatory signals in complex tissues such as the brain. ResultsOur simulator first generates single cell expression profiles and subsequently samples and aggregates these single cells to produce bulk tissue expression profiles. Using this framework, we found that there are very specific and unlikely conditions under which intracellular dynamic regulatory signals can be propagated to the bulk tissue level. For the most part, such regulatory relationships, however strong at the single cell level, are unlikely to be detectable. Our results provide a quantitative explanation for why regulatory network inference from co-expression has proved challenging - even with the assistance of other data modalities - and gives the scientific community a set of tools to further explore these issues in both single-cell and bulk tissue data. Availability and implementationAll relevant data are within the manuscript and supplementary files. The code for all data analyses and generation of figures are available on GitHub (https://github.com/PavlidisLab/coex-simulation). A copy of the data has been deposited in Borealis, the Canadian Dataverse Repository (https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP3/2CWXY6).

20

Predicting cellular responses to perturbation across diverse contexts with STATE

Adduri, A. K.; Gautam, D.; Bevilacqua, B.; Imran, A.; Shah, R.; Naghipourfar, M.; Teyssier, N.; Ilango, R.; Nagaraj, S.; Ricci-Tam, C.; Carpenter, C.; Subramanyam, V.; Winters, A.; Dong, M.; Tirukkovalur, S.; Sullivan, J.; Plosky, B. S.; Eraslan, B.; Youngblut, N. D.; Leskovec, J.; Gilbert, L. A.; Konermann, S.; Hsu, P. D.; Dobin, A.; Burke, D. P.; Goodarzi, H.; Roohani, Y. H.

2025-07-10 systems biology 10.1101/2025.06.26.661135 medRxiv

Top 0.1%

32.1%

Show abstract

Cellular responses to perturbations are a cornerstone for understanding biological mechanisms and selecting drug targets. While machine learning models offer tremendous potential for predicting perturbation effects, they currently struggle to generalize to unobserved cellular contexts. Here, we introduce SO_SCPLOWTATEC_SCPLOW, a transformer model that predicts perturbation effects while accounting for cellular heterogeneity within and across experiments. SO_SCPLOWTATEC_SCPLOW predicts perturbation effects across sets of cells and is trained using gene expression data from over 100 million perturbed cells. SO_SCPLOWTATEC_SCPLOW improved discrimination of effects on large datasets by more than 30% and identified differentially expressed genes across genetic, signaling and chemical perturbations with significantly improved accuracy. Using its cell embedding trained on observational data from 167 million cells, SO_SCPLOWTATEC_SCPLOW identified strong perturbations in novel cellular contexts where no perturbations were observed during training. We further introduce Cell-Eval, a comprehensive evaluation framework that highlights SO_SCPLOWTATEC_SCPLOWs ability to detect cell type-specific perturbation responses, such as cell survival. Overall, the performance and flexibility of SO_SCPLOWTATEC_SCPLOW sets the stage for scaling the development of virtual cell models.